Cross-dataset Clustering: Revealing Corresponding Themes across Multiple Corpora

نویسندگان

  • Ido Dagan
  • Zvika Marx
  • Eli Shamir
چکیده

We present a method for identifying corresponding themes across several corpora that are focused on related, but distinct, domains. This task is approached through simultaneous clustering of keyword sets extracted from the analyzed corpora. Our algorithm extends the informationbottleneck soft clustering method for a suitable setting consisting of several datasets. Experimentation with topical corpora reveals similar aspects of three distinct religions. The evaluation is by way of comparison to clusters constructed manually by an expert.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Generalized Framework for Revealing Analogous Themes across Related Topics

This work addresses the task of identifying thematic correspondences across subcorpora focused on different topics. We introduce an unsupervised algorithmic framework based on distributional data clustering, which generalizes previous initial works on this task. The empirical results reveal interesting commonalities of different religions. We evaluate the results through measuring the overlap o...

متن کامل

Document Representation Methods for Clustering Bilingual Documents

Globalization places people in a multilingual environment. There is a growing number of users to access and share information in several languages for public or private purpose. In order to deliver relevant information in different languages, efficient multilingual documents management is worthy of study. Generally, classification and clustering are two typical methods for documents management....

متن کامل

A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection

In this paper we describe our effort to create a dataset for the evaluation of cross-language textual similarity detection. We present preexisting corpora and their limits and we explain the various gathered resources to overcome these limits and build our enriched dataset. The proposed dataset is multilingual, includes cross-language alignment for different granularities (from chunk to documen...

متن کامل

Coupled Clustering: a Method for Detecting Structural Correspondence

This paper proposes a new paradigm and a computational framework for revealing equivalencies (analogies) between sub-structures of distinct composite systems that are initially represented by unstructured data sets. For this purpose, we introduce and investigate a variant of traditional data clustering, termed coupled clustering, which outputs a configuration of corresponding subsets of two suc...

متن کامل

Schemas of Clustering

Data mining techniques, such as clustering, have become a mainstay in many applications such as bioinformatics, geographic information systems, and marketing. Over the last decade, due to new demands posed by these applications, clustering techniques have been significantly adapted and extended. One such extension is the idea of finding clusters in a dataset that preserve information about some...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002